Expose terminalbench in run-eval workflow by neubig · Pull Request #2360 · OpenHands/software-agent-sdk

neubig · 2026-03-08T15:12:26Z

Summary

add terminalbench to the manual run-eval workflow choices
update the internal run-eval skill docs so agents know the new benchmark option exists
verify the cross-repo dispatch path against the corresponding evaluation/benchmarks feature branches

Details

The smoke run used benchmark=terminalbench, eval_limit=5, sdk_ref=main, eval_branch=openhands/terminalbench-ci-490, and benchmarks_branch=openhands/terminalbench-ci-490.
This PR pairs with matching workflow/input changes in OpenHands/evaluation and benchmark-side Harbor fixes in OpenHands/benchmarks.

Testing

Triggered workflow dispatch: https://github.com/OpenHands/software-agent-sdk/actions/runs/22823734279
Confirmed downstream evaluation workflow launch: https://github.com/OpenHands/evaluation/actions/runs/22823745521

Evidence

Verification link: View conversation

Follow-up investigation: the previously cited terminalbench smoke run did not complete end-to-end, so this PR is being moved back to draft pending real live-run evidence.

$ gh run view 22823734279 --repo OpenHands/software-agent-sdk --json status,conclusion,displayTitle,url
{"conclusion":"success","displayTitle":"Run Eval (terminalbench) Smoke test for OpenHands/benchmarks#490","status":"completed","url":"https://github.com/OpenHands/software-agent-sdk/actions/runs/22823734279"}

$ gh run view 22823745521 --repo OpenHands/evaluation --json status,conclusion,displayTitle,url
{"conclusion":"success","displayTitle":"Eval Job (terminalbench) Smoke test for OpenHands/benchmarks#490","status":"completed","url":"https://github.com/OpenHands/evaluation/actions/runs/22823745521"}

$ # Datadog pod logs for eval-22823745521-claude-son* service:python
[2026-03-08 15:11:16 UTC] Benchmark: terminalbench
[2026-03-08 15:11:16 UTC] Dispatching terminalbench build for SDK commit: 77c68ccfd7bdffb27be88e8793f76cafc45faf9d
[2026-03-08 15:11:17 UTC] ERROR: Benchmarks build dispatch failed (status 404): {"message":"Not Found","documentation_url":"https://docs.github.com/rest/actions/workflows#create-a-workflow-dispatch-event","status":"404"}
[2026-03-08 15:11:17 UTC] Deleted temporary branch: dispatch-22823745521

The GitHub Actions runs only proved that the workflow dispatch/deploy path was reachable. Datadog shows the orchestration failed before the evaluation phase, so there was no completed benchmark run, no uploaded results archive, and no Slack success notification.

Likely root cause: OpenHands/evaluation currently derives the benchmark build workflow name as build-{benchmark}-images.yml, which becomes build-terminalbench-images.yml. That workflow file does not exist on OpenHands/benchmarks (including branch openhands/terminalbench-ci-490), so the dispatch returns HTTP 404.

Checklist

CI passing
Tests are minimal and pass
No unnecessary code
Evidence from live run (with conversation link if available)
All review comments resolved
Documentation updated (if applicable)

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-03-08T15:12:49Z

API breakage checks (Griffe)

Result: Passed

Action log

github-actions · 2026-03-08T15:13:05Z

Agent server REST API breakage checks (OpenAPI)

Result: Failed

Log excerpt (first 1000 characters)

{"asctime": "2026-03-08 15:13:03,092", "levelname": "WARNING", "name": "openhands.agent_server.config", "filename": "config.py", "lineno": 173, "message": "\u26a0\ufe0f OH_SECRET_KEY was not defined. Secrets will not be persisted between restarts."}
::error title=openhands-agent-server REST API::Breaking REST API change detected without MINOR version bump (1.12.0 -> 1.12.0).

Breaking REST API changes detected compared to baseline release:
- the 'file' request property type/format changed from 'string'/'' to 'string'/'binary'
/home/runner/work/software-agent-sdk/software-agent-sdk/.venv/lib/python3.13/site-packages/litellm/llms/custom_httpx/async_client_cleanup.py:66: DeprecationWarning: There is no current event loop
  loop = asyncio.get_event_loop()

Action log

all-hands-bot

Taste Rating: 🟢 Good taste

Simple, straightforward configuration change. Adds terminalbench to workflow choices and updates skill docs. Follows existing patterns, tested with smoke run. LGTM! 🚀

all-hands-bot

Taste Rating: 🟢 Good taste

Simple, straightforward configuration change. Adds terminalbench to workflow choices and updates skill docs. Follows existing patterns, tested with smoke run. LGTM! 🚀

all-hands-bot

Taste Rating: 🟢 Good taste

The code changes are simple and correct. Appropriately kept in draft until the downstream workflow exists. The 404 error you documented confirms the integration isn't complete yet.

Verdict: ✅ Code is solid, just needs the full stack to work before merging.

Expose terminalbench in run-eval workflow

341445b

Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai bot mentioned this pull request Mar 8, 2026

Make terminalbench work with CI pipeline OpenHands/benchmarks#490

Open

all-hands-bot approved these changes Mar 8, 2026

View reviewed changes

neubig marked this pull request as draft March 9, 2026 03:07

neubig marked this pull request as ready for review March 9, 2026 17:44

all-hands-bot approved these changes Mar 9, 2026

View reviewed changes

neubig marked this pull request as draft March 10, 2026 12:50

neubig marked this pull request as ready for review March 19, 2026 08:49

all-hands-bot reviewed Mar 19, 2026

View reviewed changes

neubig merged commit 84359e3 into main Mar 19, 2026
45 checks passed

neubig deleted the openhands/terminalbench-ci-490 branch March 19, 2026 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose terminalbench in run-eval workflow#2360

Expose terminalbench in run-eval workflow#2360
neubig merged 1 commit intomainfrom
openhands/terminalbench-ci-490

neubig commented Mar 8, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neubig commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Testing

Evidence

Checklist

Uh oh!

github-actions bot commented Mar 8, 2026

API breakage checks (Griffe)

Uh oh!

github-actions bot commented Mar 8, 2026

Agent server REST API breakage checks (OpenAPI)

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neubig commented Mar 8, 2026 •

edited

Loading